Skip to main content

How do I properly size CPU & memory for my workloads

Objective

This document goal is to facilitate findingrunning Documentation Solution

Can be challenging to find the right amount of CPU/memory/GPU resources to certain AI Workloads.

Precision, batch size, model size, and context length are all tightly coupled to how much resource (especially GPU memory) is needed.

Recommended Values:

AI WorkloadCPU CoresMemory (GB)GPU numberVRAM per GPU (GB)Notes
Llama-7B8–1632–64116+Single GPU sufficient; fits L40, L40S, H100, H200
Llama-70B (FP16)16–32128–2562 (H100)80Or 1× H200 (141 GB), 3× L40 (48 GB each)
Llama-70B (Quantized 8-bit)16–32128–2561 (H100)80Or 2× L40 (48 GB each), depends on batch size
VLLM – Inference Server16–3264–128Model-dependentModel-dependentSee model requirements; e.g., 2× H100 for 70B
Nvidia NIM16–3264–128Model-dependentModel-dependent2 x H100
Infinity Server (Embeddings)8–1632–6418–16Fits L40, L40S, H100, H200; often overprovisioned
Invoke (Image Generation)8–1632–6418+Preferably 16 GB; fits all specified GPUs